Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 47
Filter
Add more filters










Publication year range
1.
Front Bioinform ; 3: 1178926, 2023.
Article in English | MEDLINE | ID: mdl-37151482

ABSTRACT

Protein annotation errors can have significant consequences in a wide range of fields, ranging from protein structure and function prediction to biomedical research, drug discovery, and biotechnology. By comparing the domains of different proteins, scientists can identify common domains, classify proteins based on their domain architecture, and highlight proteins that have evolved differently in one or more species or clades. However, genome-wide identification of different protein domain architectures involves a complex error-prone pipeline that includes genome sequencing, prediction of gene exon/intron structures, and inference of protein sequences and domain annotations. Here we developed an automated fact-checking approach to distinguish true domain loss/gain events from false events caused by errors that occur during the annotation process. Using genome-wide ortholog sets and taking advantage of the high-quality human and Saccharomyces cerevisiae genome annotations, we analyzed the domain gain and loss events in the predicted proteomes of 9 non-human primates (NHP) and 20 non-S. cerevisiae fungi (NSF) as annotated in the Uniprot and Interpro databases. Our approach allowed us to quantify the impact of errors on estimates of protein domain gains and losses, and we show that domain losses are over-estimated ten-fold and three-fold in the NHP and NSF proteins respectively. This is in line with previous studies of gene-level losses, where issues with genome sequencing or gene annotation led to genes being falsely inferred as absent. In addition, we show that insistent protein domain annotations are a major factor contributing to the false events. For the first time, to our knowledge, we show that domain gains are also over-estimated by three-fold and two-fold respectively in NHP and NSF proteins. Based on our more accurate estimates, we infer that true domain losses and gains in NHP with respect to humans are observed at similar rates, while domain gains in the more divergent NSF are observed twice as frequently as domain losses with respect to S. cerevisiae. This study highlights the need to critically examine the scientific validity of protein annotations, and represents a significant step toward scalable computational fact-checking methods that may 1 day mitigate the propagation of wrong information in protein databases.

2.
J Fungi (Basel) ; 9(4)2023 Mar 29.
Article in English | MEDLINE | ID: mdl-37108879

ABSTRACT

In fungi, the most abundant transcription factor (TF) class contains a fungal-specific 'GAL4-like' Zn2C6 DNA binding domain (DBD), while the second class contains another fungal-specific domain, known as 'fungal_trans' or middle homology domain (MHD), whose function remains largely uncharacterized. Remarkably, almost a third of MHD-containing TFs in public sequence databases apparently lack DNA binding activity, since they are not predicted to contain a DBD. Here, we reassess the domain organization of these 'MHD-only' proteins using an in silico error-tracking approach. In a large-scale analysis of ~17,000 MHD-only TF sequences present in all fungal phyla except Microsporidia and Cryptomycota, we show that the vast majority (>90%) result from genome annotation errors and we are able to predict a new DBD sequence for 14,261 of them. Most of these sequences correspond to a Zn2C6 domain (82%), with a small proportion of C2H2 domains (4%) found only in Dikarya. Our results contradict previous findings that the MHD-only TF are widespread in fungi. In contrast, we show that they are exceptional cases, and that the fungal-specific Zn2C6-MHD domain pair represents the canonical domain signature defining the most predominant fungal TF family. We call this family CeGAL, after the highly characterized members: Cep3, whose 3D structure is determined, and GAL4, a eukaryotic TF archetype. We believe that this will not only improve the annotation and classification of the Zn2C6 TF but will also provide critical guidance for future fungal gene regulatory network analyses.

3.
BMC Bioinformatics ; 22(1): 561, 2021 Nov 23.
Article in English | MEDLINE | ID: mdl-34814826

ABSTRACT

BACKGROUND: Ab initio prediction of splice sites is an essential step in eukaryotic genome annotation. Recent predictors have exploited Deep Learning algorithms and reliable gene structures from model organisms. However, Deep Learning methods for non-model organisms are lacking. RESULTS: We developed Spliceator to predict splice sites in a wide range of species, including model and non-model organisms. Spliceator uses a convolutional neural network and is trained on carefully validated data from over 100 organisms. We show that Spliceator achieves consistently high accuracy (89-92%) compared to existing methods on independent benchmarks from human, fish, fly, worm, plant and protist organisms. CONCLUSIONS: Spliceator is a new Deep Learning method trained on high-quality data, which can be used to predict splice sites in diverse organisms, ranging from human to protists, with consistently high accuracy.


Subject(s)
Algorithms , Neural Networks, Computer , Animals , Genome , Humans
4.
Biosystems ; 203: 104368, 2021 May.
Article in English | MEDLINE | ID: mdl-33567309

ABSTRACT

The X circular code is a set of 20 trinucleotides (codons) that has been identified in the protein-coding genes of most organisms (bacteria, archaea, eukaryotes, plasmids, viruses). It has been shown previously that the X circular code has the important mathematical property of being an error-correcting code. Thus, motifs of the X circular code, i.e. a series of codons belonging to X and called X motifs, allow identification and maintenance of the reading frame in genes. X motifs are significantly enriched in protein-coding genes, but have also been identified in many transfer RNA (tRNA) genes and in important functional regions of the ribosomal RNA (rRNA), notably in the peptidyl transferase center and the decoding center. Here, we investigate the potential role of X motifs as functional elements of protein-coding genes. First, we identify the codons of the X circular code which are frequent or rare in each domain of life (archaea, bacteria, eukaryota) and show that, for the amino acids with the highest codon bias, the preferred codon is often an X codon. We also observe a correlation between the 20 X codons and the optimal codons/dicodons that have been shown to influence translation efficiency. Then, we examined recently published experimental results concerning gene expression levels in diverse organisms. The approach used is the analysis of X motifs according to their density ds(X), i.e. the number of X motifs per kilobase in a gene sequence s. Surprisingly, this simple parameter identifies several unexpected relations between the X circular code and gene expression. For example, the X motifs are significantly enriched in the minimal gene set belonging to the three domains of life, and in codon-optimized genes. Furthermore, the density of X motifs generally correlates with experimental measures of translation efficiency and mRNA stability. Taken together, these results lead us to propose that the X motifs may represent a genetic signal contributing to the maintenance of the correct reading frame and the optimization and regulation of gene expression.


Subject(s)
Codon/genetics , Gene Expression Regulation/genetics , Nucleotide Motifs/genetics , Genetic Code/genetics , Reading Frames , Ribosomes
5.
Genome Biol Evol ; 13(1)2021 01 07.
Article in English | MEDLINE | ID: mdl-33211099

ABSTRACT

In the multiomics era, comparative genomics studies based on gene repertoire comparison are increasingly used to investigate evolutionary histories of species, to study genotype-phenotype relations, species adaptation to various environments, or to predict gene function using phylogenetic profiling. However, comparisons of orthologs have highlighted the prevalence of sequence plasticity among species, showing the benefits of combining protein and subprotein levels of analysis to allow for a more comprehensive study of genotype/phenotype correlations. In this article, we introduce a new approach called BLUR (BLAST Unexpected Ranking), capable of detecting genotype divergence or specialization between two related clades at different levels: gain/loss of proteins but also of subprotein regions. These regions can correspond to known domains, uncharacterized regions, or even small motifs. Our method was created to allow two types of research strategies: 1) the comparison of two groups of species with no previous knowledge, with the aim of predicting phenotype differences or specializations between close species or 2) the study of specific phenotypes by comparing species that present the phenotype of interest with species that do not. We designed a website to facilitate the use of BLUR with a possibility of in-depth analysis of the results with various tools, such as functional enrichments, protein-protein interaction networks, and multiple sequence alignments. We applied our method to the study of two different biological pathways and to the comparison of several groups of close species, all with very promising results. BLUR is freely available at http://lbgi.fr/blur/.


Subject(s)
Evolution, Molecular , Genomics/methods , Proteins/genetics , Proteome/genetics , Proteome/metabolism , Animals , Armadillo Domain Proteins , Bacteria , Conserved Sequence/genetics , Fungi , Genotype , Humans , Phenotype , Phylogeny , Sequence Alignment , Sequence Analysis , Software
6.
BMC Bioinformatics ; 21(1): 513, 2020 Nov 10.
Article in English | MEDLINE | ID: mdl-33172385

ABSTRACT

BACKGROUND: Recent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon-intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subsequent analyses. RESULTS: We first investigated the prevalence of gene prediction errors in a large set of 176,478 proteins from ten primate proteomes available in public databases. Using the well-studied human proteins as a reference, a total of 82,305 potential errors were detected, including 44,001 deletions, 27,289 insertions and 11,015 mismatched segments where part of the correct protein sequence is replaced with an alternative erroneous sequence. We then focused on the mismatched sequence errors that cause particular problems for downstream applications. A detailed characterization allowed us to identify the potential causes for the gene misprediction in approximately half (5446) of these cases. As a proof-of-concept, we also developed a simple method which allowed us to propose improved sequences for 603 primate proteins. CONCLUSIONS: Gene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or assembly issues, and limitations in the models used to represent gene exon-intron structures. Nevertheless, existing genome sequences can still be exploited to improve protein sequence quality. Perspectives of the work include the characterization of other types of gene prediction errors, as well as the development of a more comprehensive algorithm for protein sequence error correction.


Subject(s)
Open Reading Frames/genetics , Primates/metabolism , Proteome , Amino Acid Sequence , Animals , Databases, Protein , Gene Deletion , Humans , Mutagenesis, Insertional , Receptor-Like Protein Tyrosine Phosphatases/chemistry , Receptor-Like Protein Tyrosine Phosphatases/genetics , Receptor-Like Protein Tyrosine Phosphatases/metabolism , Sequence Alignment
7.
BMC Genomics ; 21(1): 293, 2020 Apr 09.
Article in English | MEDLINE | ID: mdl-32272892

ABSTRACT

BACKGROUND: The draft genome assemblies produced by new sequencing technologies present important challenges for automatic gene prediction pipelines, leading to less accurate gene models. New benchmark methods are needed to evaluate the accuracy of gene prediction methods in the face of incomplete genome assemblies, low genome coverage and quality, complex gene structures, or a lack of suitable sequences for evidence-based annotations. RESULTS: We describe the construction of a new benchmark, called G3PO (benchmark for Gene and Protein Prediction PrOgrams), designed to represent many of the typical challenges faced by current genome annotation projects. The benchmark is based on a carefully validated and curated set of real eukaryotic genes from 147 phylogenetically disperse organisms, and a number of test sets are defined to evaluate the effects of different features, including genome sequence quality, gene structure complexity, protein length, etc. We used the benchmark to perform an independent comparative analysis of the most widely used ab initio gene prediction programs and identified the main strengths and weaknesses of the programs. More importantly, we highlight a number of features that could be exploited in order to improve the accuracy of current prediction tools. CONCLUSIONS: The experiments showed that ab initio gene structure prediction is a very challenging task, which should be further investigated. We believe that the baseline results associated with the complex gene test sets in G3PO provide useful guidelines for future studies.


Subject(s)
Computational Biology/methods , Eukaryota/genetics , Molecular Sequence Annotation/methods , Animals , Data Curation , Evolution, Molecular , Humans , Phylogeny
8.
Biosystems ; 195: 104134, 2020 Jul.
Article in English | MEDLINE | ID: mdl-32251681

ABSTRACT

The standard genetic code (SGC) describes how 64 trinucleotides (codons) encode 20 amino acids and the stop translation signal. Biochemical and statistical studies have shown that the standard genetic code is optimized to reduce the impact of errors caused by incorporation of wrong amino acids during translation. This is achieved by mapping codons that differ by only one nucleotide to the same amino acid or one with similar biochemical properties, so that if misincorporation occurs, the structure and function of the translated protein remain relatively unaltered. Some previous studies have extended the analysis of SGC optimality to the effect of frameshift errors on the conservation of amino acids. Here, we compare the optimality of the SGC with a set of circular codes, and in particular the X circular code identified in genes, on the basis of various biochemical properties over all possible frameshift errors. We show that the X circular code is more optimized to minimize the impact of frameshift errors than the SGC for the chosen amino acid properties. Furthermore, in the context of a problem that has been unresolved since 1996, we also demonstrate that the X circular code has a frameshift optimality in its combinatorial class of 216 maximal self-complementary C3 circular codes. To our knowledge, this is the first demonstration of the role of the X circular code in mitigation of translation errors. These results lead us to discuss the potential role of the X circular code in the evolution of the standard genetic code.


Subject(s)
Amino Acid Substitution/genetics , Evolution, Molecular , Frameshift Mutation/genetics , Genetic Code/genetics , Codon , Mutation, Missense , Reading Frames
9.
RNA Biol ; 17(4): 571-583, 2020 04.
Article in English | MEDLINE | ID: mdl-31960748

ABSTRACT

Three-base periodicity (TBP), where nucleotides and higher order n-tuples are preferentially spaced by 3, 6, 9, etc. bases, is a well-known intrinsic property of protein-coding DNA sequences. However, its origins are still not fully understood. One hypothesis is that the periodicity reflects a primordial coding system that was used before the emergence of the modern standard genetic code (SGC). Recent evidence suggests that the X circular code, a set of 20 trinucleotides allowing the reading frames in genes to be retrieved locally, represents a possible ancestor of the SGC. Motifs from the X circular code have been found in the reading frame of protein-coding regions in extant organisms from bacteria to eukaryotes, in many transfer RNA (tRNA) genes and in important functional regions of the ribosomal RNA (rRNA), notably in the peptidyl transferase centre and the decoding centre. Here, we have used a powerful correlation function to search for periodicity patterns involving the 20 trinucleotides of the X circular code in a large set of bacterial protein-coding genes, as well as in the translation machinery, including rRNA and tRNA sequences. As might be expected, we found a strong circular code periodicity 0 modulo 3 in the protein-coding genes. More surprisingly, we also identified a similar circular code periodicity in a large region of the 16S rRNA. This region includes the 3' major domain corresponding to the primordial proto-ribosome decoding centre and containing numerous sites that interact with the tRNA and messenger RNA (mRNA) during translation. Furthermore, 3D structural analysis shows that the periodicity region surrounds the mRNA channel that lies between the head and the body of the SSU. Our results support the hypothesis that the X circular code may constitute an ancestral translation code involved in reading frame retrieval and maintenance, traces of which persist in modern mRNA, tRNA and rRNA despite their long evolution and adaptation to the SGC.


Subject(s)
Bacteria/genetics , Bacterial Proteins/genetics , Computational Biology/methods , Ribosomes/genetics , Algorithms , Bacteria/metabolism , Evolution, Molecular , Genetic Code , Periodicity , RNA, Bacterial/genetics , RNA, Ribosomal/genetics , RNA, Transfer/genetics
10.
RNA ; 25(12): 1714-1730, 2019 12.
Article in English | MEDLINE | ID: mdl-31506380

ABSTRACT

The origin of the genetic code remains enigmatic five decades after it was elucidated, although there is growing evidence that the code coevolved progressively with the ribosome. A number of primordial codes were proposed as ancestors of the modern genetic code, including comma-free codes such as the RRY, RNY, or GNC codes (R = G or A, Y = C or T, N = any nucleotide), and the X circular code, an error-correcting code that also allows identification and maintenance of the reading frame. It was demonstrated previously that motifs of the X circular code are significantly enriched in the protein-coding genes of most organisms, from bacteria to eukaryotes. Here, we show that imprints of this code also exist in the ribosomal RNA (rRNA). In a large-scale study involving 133 organisms representative of the three domains of life, we identified 32 universal X motifs that are conserved in the rRNA of >90% of the organisms. Intriguingly, most of the universal X motifs are located in rRNA regions involved in important ribosome functions, notably in the peptidyl transferase center and the decoding center that form the original "proto-ribosome." Building on the existing accretion models for ribosome evolution, we propose that error-correcting circular codes represented an important step in the emergence of the modern genetic code. Thus, circular codes would have allowed the simultaneous coding of amino acids and synchronization of the reading frame in primitive translation systems, prior to the emergence of more sophisticated start codon recognition and translation initiation mechanisms.


Subject(s)
Evolution, Molecular , Genetic Code , Nucleotide Motifs , Protein Biosynthesis , Ribosomes/genetics , Ribosomes/metabolism , Models, Biological , Models, Molecular , Molecular Conformation , Nucleic Acid Conformation , RNA, Ribosomal/chemistry , RNA, Ribosomal/genetics , Ribosomes/chemistry , Structure-Activity Relationship
11.
Nucleic Acids Res ; 47(D1): D411-D418, 2019 01 08.
Article in English | MEDLINE | ID: mdl-30380106

ABSTRACT

OrthoInspector is one of the leading software suites for orthology relations inference. In this paper, we describe a major redesign of the OrthoInspector online resource along with a significant increase in the number of species: 4753 organisms are now covered across the three domains of life, making OrthoInspector the most exhaustive orthology resource to date in terms of covered species (excluding viruses). The new website integrates original data exploration and visualization tools in an ergonomic interface. Distributions of protein orthologs are represented by heatmaps summarizing their evolutionary histories, and proteins with similar profiles can be directly accessed. Two novel tools have been implemented for comparative genomics: a phylogenetic profile search that can be used to find proteins with a specific presence-absence profile and investigate their functions and, inversely, a GO profiling tool aimed at deciphering evolutionary histories of molecular functions, processes or cell components. In addition to the re-designed website, the OrthoInspector resource now provides a REST interface for programmatic access. OrthoInspector 3.0 is available at http://lbgi.fr/orthoinspectorv3.


Subject(s)
Databases, Genetic , Genomics , Algorithms , Bacteria/genetics , Classification , Eukaryota/genetics , Evolution, Molecular , Forecasting , Gene Ontology , Internet , Phylogeny , Proteome , Sequence Homology, Nucleic Acid , Software , Species Specificity
12.
Biosystems ; 175: 57-74, 2019 Jan.
Article in English | MEDLINE | ID: mdl-30367916

ABSTRACT

A set X of 20 trinucleotides has been found to have the highest average occurrence in the reading frame, compared to the two shifted frames, of genes of bacteria, archaea, eukaryotes, plasmids and viruses (Michel, 2015, 2017; Arquès and Michel, 1996). This set X has an interesting mathematical property, since X is a maximal C3 self-complementary trinucleotide circular code (Arquès and Michel, 1996). Furthermore, any motif obtained from this circular code X has the capacity to retrieve, maintain and synchronize the reading frame in genes. In a recent study of the X motifs in the complete genome of the yeast, Saccharomyces cerevisiae, it was shown that they are significantly enriched in the reading frame of the genes (protein-coding regions) of the genome (Michel et al., 2017). It was suggested that these X motifs may be evolutionary relics of a primitive code originally used for gene translation. The aim of this paper is to address two questions: are X motifs conserved during evolution? and do they continue to play a functional role in the processes of genome decoding and protein production? In a large scale analysis involving complete genomes from four mammals and nine different yeast species, we highlight specific evolutionary pressures on the X motifs in the genes of all the genomes, and identify important new properties of X motif conservation at the level of the encoded amino acids. We then compare the occurrence of X motifs with existing experimental data concerning protein expression and protein production, and report a significant correlation between the number of X motifs in a gene and increased protein abundance. In a general way, this work suggests that motifs from circular codes, i.e. motifs having the property of reading frame retrieval, may represent functional elements located within the coding regions of extant genomes.


Subject(s)
Algorithms , Eukaryota/genetics , Evolution, Molecular , Genetic Code , Genome , Models, Genetic , Nucleotide Motifs , Animals , Base Sequence , Eukaryota/physiology , Sequence Homology
13.
Bioinformatics ; 34(19): 3390-3392, 2018 10 01.
Article in English | MEDLINE | ID: mdl-29741582

ABSTRACT

Summary: Comparative studies of protein sequences are widely used in evolutionary and comparative genomics studies, but there is a lack of efficient tools to identify conserved regions ab initio within a protein multiple alignment. PROBE provides a fully automatic analysis of protein family conservation, to identify conserved regions, or 'blocks', that may correspond to structural/functional domains or motifs. Conserved blocks are identified at two different levels: (i) family level blocks indicate sites that are probably of central importance to the protein's structure or function, and (ii) sub-family level blocks highlight regions that may signify functional specialization, such as binding partners, etc. All conserved blocks are mapped onto a phylogenetic tree and can also be visualized in the context of the multiple sequence alignment. PROBE thus facilitates in-depth studies of sequence-structure-function-evolution relationships, and opens the way to block-level phylogenetic profiling. Availability and implementation: Freely available on the web at http://www.lbgi.fr/∼julie/probe/web.


Subject(s)
Evolution, Molecular , Proteins/genetics , Software , Amino Acid Sequence , Computational Biology , Conserved Sequence , Phylogeny , Sequence Alignment
14.
Life (Basel) ; 7(4)2017 Dec 03.
Article in English | MEDLINE | ID: mdl-29207500

ABSTRACT

A set X of 20 trinucleotides has been found to have the highest average occurrence in the reading frame, compared to the two shifted frames, of genes of bacteria, archaea, eukaryotes, plasmids and viruses. This set X has an interesting mathematical property, since X is a maximal C3 self-complementary trinucleotide circular code. Furthermore, any motif obtained from this circular code X has the capacity to retrieve, maintain and synchronize the original (reading) frame. Since 1996, the theory of circular codes in genes has mainly been developed by analysing the properties of the 20 trinucleotides of X, using combinatorics and statistical approaches. For the first time, we test this theory by analysing the X motifs, i.e., motifs from the circular code X, in the complete genome of the yeast Saccharomyces cerevisiae. Several properties of X motifs are identified by basic statistics (at the frequency level), and evaluated by comparison to R motifs, i.e., random motifs generated from 30 different random codes R. We first show that the frequency of X motifs is significantly greater than that of R motifs in the genome of S. cerevisiae. We then verify that no significant difference is observed between the frequencies of X and R motifs in the non-coding regions of S. cerevisiae, but that the occurrence number of X motifs is significantly higher than R motifs in the genes (protein-coding regions). This property is true for all cardinalities of X motifs (from 4 to 20) and for all 16 chromosomes. We further investigate the distribution of X motifs in the three frames of S. cerevisiae genes and show that they occur more frequently in the reading frame, regardless of their cardinality or their length. Finally, the ratio of X genes, i.e., genes with at least one X motif, to non-X genes, in the set of verified genes is significantly different to that observed in the set of putative or dubious genes with no experimental evidence. These results, taken together, represent the first evidence for a significant enrichment of X motifs in the genes of an extant organism. They raise two hypotheses: the X motifs may be evolutionary relics of the primitive codes used for translation, or they may continue to play a functional role in the complex processes of genome decoding and protein synthesis.

15.
Mol Biol Evol ; 34(8): 2016-2034, 2017 08 01.
Article in English | MEDLINE | ID: mdl-28460059

ABSTRACT

Cilia (flagella) are important eukaryotic organelles, present in the Last Eukaryotic Common Ancestor, and are involved in cell motility and integration of extracellular signals. Ciliary dysfunction causes a class of genetic diseases, known as ciliopathies, however current knowledge of the underlying mechanisms is still limited and a better characterization of genes is needed. As cilia have been lost independently several times during evolution and they are subject to important functional variation between species, ciliary genes can be investigated through comparative genomics. We performed phylogenetic profiling by predicting orthologs of human protein-coding genes in 100 eukaryotic species. The analysis integrated three independent methods to predict a consensus set of 274 ciliary genes, including 87 new promising candidates. A fine-grained analysis of the phylogenetic profiles allowed a partitioning of ciliary genes into modules with distinct evolutionary histories and ciliary functions (assembly, movement, centriole, etc.) and thus propagation of potential annotations to previously undocumented genes. The cilia/basal body localization was experimentally confirmed for five of these previously unannotated proteins (LRRC23, LRRC34, TEX9, WDR27, and BIVM), validating the relevance of our approach. Furthermore, our multi-level analysis sheds light on the core gene sets retained in gamete-only flagellates or Ecdysozoa for instance. By combining gene-centric and species-oriented analyses, this work reveals new ciliary and ciliopathy gene candidates and provides clues about the evolution of ciliary processes in the eukaryotic domain. Additionally, the positive and negative reference gene sets and the phylogenetic profile of human genes constructed during this study can be exploited in future work.


Subject(s)
Cilia/genetics , Ciliopathies/genetics , Animals , Cell Movement/genetics , Cilia/metabolism , Ciliopathies/metabolism , Databases, Nucleic Acid , Eukaryota , Eukaryotic Cells , Evolution, Molecular , Flagella/genetics , Flagella/metabolism , Genomics , Humans , Phylogeny , Sequence Analysis, DNA/methods
16.
BMC Bioinformatics ; 17(1): 271, 2016 Jul 07.
Article in English | MEDLINE | ID: mdl-27387560

ABSTRACT

BACKGROUND: A standard procedure in many areas of bioinformatics is to use a multiple sequence alignment (MSA) as the basis for various types of homology-based inference. Applications include 3D structure modelling, protein functional annotation, prediction of molecular interactions, etc. These applications, however sophisticated, are generally highly sensitive to the alignment used, and neglecting non-homologous or uncertain regions in the alignment can lead to significant bias in the subsequent inferences. RESULTS: Here, we present a new method, LEON-BIS, which uses a robust Bayesian framework to estimate the homologous relations between sequences in a protein multiple alignment. Sequences are clustered into sub-families and relations are predicted at different levels, including 'core blocks', 'regions' and full-length proteins. The accuracy and reliability of the predictions are demonstrated in large-scale comparisons using well annotated alignment databases, where the homologous sequence segments are detected with very high sensitivity and specificity. CONCLUSIONS: LEON-BIS uses robust Bayesian statistics to distinguish the portions of multiple sequence alignments that are conserved either across the whole family or within subfamilies. LEON-BIS should thus be useful for automatic, high-throughput genome annotations, 2D/3D structure predictions, protein-protein interaction predictions etc.


Subject(s)
Bayes Theorem , Computational Biology/methods , Proteins/chemistry , Sequence Alignment/methods , Amino Acid Sequence , Humans , Proteins/genetics , Sequence Homology, Amino Acid
17.
Bioinformatics ; 31(3): 447-8, 2015 Feb 01.
Article in English | MEDLINE | ID: mdl-25273105

ABSTRACT

SUMMARY: We previously developed OrthoInspector, a package incorporating an original algorithm for the detection of orthology and inparalogy relations between different species. We have added new functionalities to the package. While its original algorithm was not modified, performing similar orthology predictions, we facilitated the prediction of very large databases (thousands of proteomes), refurbished its graphical interface, added new visualization tools for comparative genomics/protein family analysis and facilitated its deployment in a network environment. Finally, we have released three online databases of precomputed orthology relationships. AVAILABILITY: Package and databases are freely available at http://lbgi.fr/orthoinspector with all major browsers supported. CONTACT: odile.lecompte@unistra.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Computer Graphics , Databases, Factual , Proteomics/methods , Sequence Analysis, Protein/methods , Software , Humans , Molecular Sequence Annotation , Phylogeny
18.
Nutrients ; 7(1): 1-16, 2014 Dec 24.
Article in English | MEDLINE | ID: mdl-25545100

ABSTRACT

Essential amino acids (EAA) consist of a group of nine amino acids that animals are unable to synthesize via de novo pathways. Recently, it has been found that most metazoans lack the same set of enzymes responsible for the de novo EAA biosynthesis. Here we investigate the sequence conservation and evolution of all the metazoan remaining genes for EAA pathways. Initially, the set of all 49 enzymes responsible for the EAA de novo biosynthesis in yeast was retrieved. These enzymes were used as BLAST queries to search for similar sequences in a database containing 10 complete metazoan genomes. Eight enzymes typically attributed to EAA pathways were found to be ubiquitous in metazoan genomes, suggesting a conserved functional role. In this study, we address the question of how these genes evolved after losing their pathway partners. To do this, we compared metazoan genes with their fungal and plant orthologs. Using phylogenetic analysis with maximum likelihood, we found that acetolactate synthase (ALS) and betaine-homocysteine S-methyltransferase (BHMT) diverged from the expected Tree of Life (ToL) relationships. High sequence conservation in the paraphyletic group Plant-Fungi was identified for these two genes using a newly developed Python algorithm. Selective pressure analysis of ALS and BHMT protein sequences showed higher non-synonymous mutation ratios in comparisons between metazoans/fungi and metazoans/plants, supporting the hypothesis that these two genes have undergone non-ToL evolution in animals.


Subject(s)
Amino Acids, Essential/biosynthesis , Conserved Sequence/genetics , Acetolactate Synthase/genetics , Acetolactate Synthase/metabolism , Amino Acid Sequence , Animals , Betaine-Homocysteine S-Methyltransferase/genetics , Betaine-Homocysteine S-Methyltransferase/metabolism , Biological Evolution , Fungi/enzymology , Fungi/genetics , Humans , Phylogeny , Plants/enzymology , Plants/genetics , Saccharopine Dehydrogenases/genetics , Saccharopine Dehydrogenases/metabolism
19.
Bioinformatics ; 30(17): 2432-9, 2014 Sep 01.
Article in English | MEDLINE | ID: mdl-24825613

ABSTRACT

MOTIVATION: The prediction of protein coding genes is a major challenge that depends on the quality of genome sequencing, the accuracy of the model used to elucidate the exonic structure of the genes and the complexity of the gene splicing process leading to different protein variants. As a consequence, today's protein databases contain a huge amount of inconsistency, due to both natural variants and sequence prediction errors. RESULTS: We have developed a new method, called SIBIS, to detect such inconsistencies based on the evolutionary information in multiple sequence alignments. A Bayesian framework, combined with Dirichlet mixture models, is used to estimate the probability of observing specific amino acids and to detect inconsistent or erroneous sequence segments. We evaluated the performance of SIBIS on a reference set of protein sequences with experimentally validated errors and showed that the sensitivity is significantly higher than previous methods, with only a small loss of specificity. We also assessed a large set of human sequences from the UniProt database and found evidence of inconsistency in 48% of the previously uncharacterized sequences. We conclude that the integration of quality control methods like SIBIS in automatic analysis pipelines will be critical for the robust inference of structural, functional and phylogenetic information from these sequences. AVAILABILITY AND IMPLEMENTATION: Source code, implemented in C on a linux system, and the datasets of protein sequences are freely available for download at http://www.lbgi.fr/∼julie/SIBIS.


Subject(s)
Sequence Analysis, Protein/methods , Algorithms , Animals , Bayes Theorem , Databases, Protein , Humans , Macaca mulatta , Phylogeny , Sequence Alignment , Software
20.
BMC Bioinformatics ; 15: 111, 2014 Apr 17.
Article in English | MEDLINE | ID: mdl-24742296

ABSTRACT

BACKGROUND: Small insertion and deletion polymorphisms (Indels) are the second most common mutations in the human genome, after Single Nucleotide Polymorphisms (SNPs). Recent studies have shown that they have significant influence on genetic variation by altering human traits and can cause multiple human diseases. In particular, many Indels that occur in protein coding regions are known to impact the structure or function of the protein. A major challenge is to predict the effects of these Indels and to distinguish between deleterious and neutral variants. When an Indel occurs within a coding region, it can be either frameshifting (FS) or non-frameshifting (NFS). FS-Indels either modify the complete C-terminal region of the protein or result in premature termination of translation. NFS-Indels insert/delete multiples of three nucleotides leading to the insertion/deletion of one or more amino acids. RESULTS: In order to study the relationships between NFS-Indels and Mendelian diseases, we characterized NFS-Indels according to numerous structural, functional and evolutionary parameters. We then used these parameters to identify specific characteristics of disease-causing and neutral NFS-Indels. Finally, we developed a new machine learning approach, KD4i, that can be used to predict the phenotypic effects of NFS-Indels. CONCLUSIONS: We demonstrate in a large-scale evaluation that the accuracy of KD4i is comparable to existing state-of-the-art methods. However, a major advantage of our approach is that we also provide the reasons for the predictions, in the form of a set of rules. The rules are interpretable by non-expert humans and they thus represent new knowledge about the relationships between the genotype and phenotypes of NFS-Indels and the causative molecular perturbations that result in the disease.


Subject(s)
Artificial Intelligence , INDEL Mutation , Phenotype , Proteins/genetics , Humans , Kinesins/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...